Available online at www.sciencedirect.com 


MOLECULAR 
RGSS science @oinecr: PHYLOGENETICS 
ae a EVOLUTION 
ELSEVIER Molecular Phylogenetics and Evolution 36 (2005) 224-232 


www.elsevier.com/locate/ympev 


Coronavirus phylogeny based on a geometric approach 


Wen-Xin Zheng”, Ling-Ling Chen *°, Hong-Yu Ou”, Feng Gao”, Chun-Ting Zhang ** 


“ Department of Physics, Tianjin University, Tianjin 300072, China 
> Laboratory for Computational Biology, Shandong Provincial Research Center for Bioinformatic Engineering and Technique, 
Shandong University of Technology, Zibo 255049, China 


Received 24 May 2004; revised 12 January 2005 
Available online 10 May 2005 


Abstract 


A novel coronavirus has been identified as the cause of the outbreak of severe acute respiratory syndrome (SARS). Previous phy- 
logenetic analyses based on sequence alignments show that SARS-CoVs form a new group distantly related to the other three groups 
of previously characterized coronaviruses. In this paper, a geometric approach based on the Z-curve representation of the whole 
genome sequence is proposed to analyze the phylogenetic relationships of coronaviruses. The evolutionary distances are obtained 
through measuring the differences among the three-dimensional Z-curves. The Z-curve is approximately described by its geometric 
center and the associated three eigenvectors, which indicate the center position and the trend of the Z-curve, respectively. Although 
some information is lost due to the approximate description of the Z-curve, the phylogenetic tree constructed based on these param- 
eters is consistent with those of previous analyses. The present method has the merits of simplicity and intuitiveness, but it is still in 
its premature stage. Because the phylogenetic relationships are inferred from the whole genome, instead of some individual genes, 


the present method represents a new direction of phylogeny study in the post-genome era. 
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1. Introduction 


The outbreak of atypical pneumonia, referred to as 
severe acute respiratory syndrome (SARS) was first 
identified in Guangdong Province, China, and spread 
to several countries later (Drosten et al., 2003; Ksiazek 
et al., 2003; Lee et al., 2003; Peiris et al., 2003; Poutanen 
et al., 2003; Tsang et al., 2003). A novel coronavirus was 
isolated and found to be the cause of SARS. Although 
SARS has been under control, some scattering cases in- 
fected by SARS-CoVs were reported. No effective drugs 
are currently available to cure this disease. Gaining 1n- 
sight into the phylogenetic relationships among coro- 
naviruses would be helpful to discover drugs and 
develop vaccines against the virus. 
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The SARS-coronavirus is a new member of the order 
Nidovirales, family Coronaviridae, and genus Coronavi- 
rus. They consist of a diverse group of large, enveloped, 
positive-stranded RNA viruses that cause respiratory 
and enteric diseases in humans and other animals (Rota 
et al., 2003). Excluding SARS-CoVs, coronaviruses can 
be divided into three groups according to serotypes. 
Group I and group II contain mammalian viruses, while 
group II coronaviruses contain a hemagglutinin esterase 
gene homologous to that of Influenza C virus (Lai and 
Holmes, 2001). Group HI contains only avian viruses. 
Previous work showed that SARS-CoVs are not closely 
related to any of the previously characterized coronavi- 
ruses and form a distinct group (group IV) within the 
genus Coronavirus (Marra et al., 2003; Rota et al., 2003). 

An intuitive method is proposed to infer the phylo- 
genetic relationships of coronaviruses in this article. 
Historically, Cork et al. proposed a three-dimensional 
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representation of genomic sequences, called the W- 
curve (Wu et al., 1993). Since then, the W-curve has 
been used to analyze genomic sequences and study 
the phylogeny of bacteria (Cork, 2003; Cork et al., 
2002; Cork and Toguem, 2002). Instead of the se- 
quence alignment, we adopt a geometric method based 
on the Z-curve of the whole genome. The Z-curve is a 
three-dimensional space curve constituting the unique 
representation of a given DNA sequence in the sense 
that each can be reconstructed given the other (Zhang 
and Zhang, 1991, 1994). Based on the Z-curve method, 
a coronavirus-specific gene-finding system ZCURVE _ 
CoV has been developed (Chen et al., 2003), and the 
software is especially suitable for gene recognition in 
SARS-CoV genomes. The system is further improved 
by taking the prediction of cleavage sites of viral 
proteinases in polyproteins into consideration (Gao 
et al., 2003). Here we use the differences between the 
three-dimensional space curves as the foundation to de- 
rive the phylogeny of coronaviruses. The key problems 
are what parameters should be used to describe a curve 
and how to determine evolutionary distances among 
organisms based on a group of curves. In this paper, 
we use a Series of parameters, such as the geometric 
center and the covariance matrix to reflect the center 
position and the distribution pattern of a curve, respec- 
tively. The result shows that SARS-CoVs form an 
independent group, which is consistent with previous 
analyses. 


2. Materials and methods 
2.1. Materials 


The 24 complete coronavirus genomes used in this 
paper were downloaded from GenBank, of which 12 
are SARS-CoVs and 12 are from other groups of coro- 
naviruses. The name, accession number, abbreviation, 
and genome length for the 24 genomes are listed in 
Table 1. According to the existing taxonomic groups, 
sequences 1-3 belong to group I, and sequences 4-11 
are members of group II, while sequence 12 is the only 
representative of group III. Refer to Table 1 for details. 


2.2. The Z-curve 


The Z-curve is a three-dimensional curve that consti- 
tutes a unique representation of a given DNA sequence 
in the sense that each can be uniquely reconstructed gi- 
ven the other (Zhang and Zhang, 1991, 1994). The 
resulting curve has a zigzag shape, hence the name Z- 
curve. The Z-curve is briefly presented as follows. Con- 
sider a DNA sequence read from the 5’ to the 3’-end 
with N bases. Beginning from the first base, inspect 
the sequence one base at a time. In the mth step, where 
n=1,2,...,N, count the cumulative numbers of the 
bases A, C, G, and T, occurring in the subsequence from 
the first base to the nth base in the DNA sequence in- 
spected, and denote them by A,, C,, G,,, and T,, respec- 


Table 1 
The accession number, abbreviation, name, and length for each of the 24 coronavirus genomes 
No. Accession Group Abbreviation 
l NC_002645 I HCoV-229E 
2 NC_002306 I TGEV 
3 NC_003436 I PEDV 
4 U00735 I] BCoVM 
° AF391542 II BCoVL 
6 AF220295 II BCoVQ 
7 NC_003045 I] BCoV 
8 AF208067 II MHVM 
9 AF201929 II MHV2 
10 AF208066 II MHVP 
1] NC_001846 II MHV 
12 NC_001451 Il IBV 
13 AY278488 IV BJO1 
14 AY278741 IV Urbani 
15 AY278491 IV HKU-39849 
16 AY278554 IV CUHK-WI1 
ly AY282752 IV CUHK-Sul10 
18 AY283794 IV SIN2500 
19 AY 283795 IV SIN2677 
20 AY 283796 IV SIN2679 
21 AY 283797 IV SIN2748 
22 AY 283798 IV SIN2774 
23 AY291451 IV TWI1 
24 NC_004718 IV TOR2 


Genome Length (nt) 
Human coronavirus 229E 21.317 
Transmissible gastroenteritis virus 28,586 
Porcine epidemic diarrhea virus 28,033 
Bovine coronavirus strain Mebus 31,032 
Bovine coronavirus isolate BCoV-LUN 31,028 
Bovine coronavirus strain Quebec 31,100 
Bovine coronavirus 31,028 
Murine hepatitis virus strain ML-10 31,233 
Murine hepatitis virus strain 2 31,276 
Murine hepatitis virus strain Penn 97-1 Ail 
Murine hepatitis virus 31,357 
Avian infectious bronchitis virus 27,608 
SARS coronavirus BJ01 29,725 
SARS coronavirus Urbani 29.127 
SARS coronavirus HK U-39849 29,742 
SARS coronavirus CUHK-WI1 29,736 
SARS coronavirus CUHK-Su10 29,736 
SARS coronavirus $in2500 29,711 
SARS coronavirus Sin2677 29,705 
SARS coronavirus Sin2679 29,711 
SARS coronavirus $in2748 29,706 
SARS coronavirus $in2774 29,711 
SARS coronavirus TW1 29,729 
SARS coronavirus 29,751 
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tively. The Z-curve consists of a series of nodes P,, 
where n= 1,2,...,N, whose coordinates are uniquely 
determined by the Z-transform of DNA _ sequences 
(Zhang and Zhang, 1991, 1994) 

Xn = (A, + G,) — (C, + T,) = Ra -— Yn, 
a (A, + C,) _ (G, + T,,) = M,, _ K,,, 
Zn = (A, ae? gs) = (ee + G,) = Sn 
n=0,1,...,N, Xn,Vi»2Zn € [-N, NI], 


(1) 


where Ap = Co = Go = To = 0 and Xo = Vo = 20 = O. 
Here R, Y, M, K, W, and S represent the bases of puR- 
ine, pYrimidine, aMino, Keto, Weak hydrogen bonds, 
and Strong hydrogen bonds, respectively, according to 
the Recommendation 1984 by the NC-IUB (Cornish- 
Bowden, 1985). The line that connects the nodes Po 
(Po =0), Pi, Po,..., until Py one by one sequentially 
is called the Z-curve for the DNA sequences inspected. 
The Z-curve defined above is a three-dimensional space 
curve, having three independent components, 1.e., X,, Vn, 
and z,, which display the distributions of bases of R/Y, 
M/K, and W/S types, respectively, along the sequence. 
By viewing the Z-curve, some global and local features 
of the sequence can be detected in a perceivable way. 
For almost all genome or chromosome sequences, the 
curves of z,~n are roughly straight lines (Zhang 
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et al., 2001). For convenience, the curve of z, ~ n 1s fit- 
ted by a straight line using the least square technique 


Z= kn, (2) 


where (z,n) 1s the coordinate of a point on the fitted 
straight line and k 1s its slope. Instead of using the curve 
of z, ~n, we will use the z) ~ n curve hereafter, where 


SZ ki. (3) 


2.3. Algorithm 


In this paper, we propose a new way to infer evolu- 
tionary distances between organisms from the whole 
genome sequences. As the Z-curve is a unique represen- 
tation of a genome, it can be used to reflect a genome’s 
characteristics (Fig. 1). For convenience, we use the 
coordinates (XY, Y,Z') rather than (XY, Y,Z). The differ- 
ences among the Z-curves of these genomes form the ba- 
sis for constructing the phylogenetic tree. To study the 
phylogenetic relationships, the process can be separated 
into three stages. First, the Z-curve of each genome 1s 
described by a set of parameters; second, the distance 
matrix 1s generated based on the parameters obtained 
in the first stage; and finally, the phylogenetic tree can 
be constructed based on the distance matrix. 


B SARS coronavirus TOR2, complete genome 
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Fig. 1. The three-dimensional Z-curves (x,y,z’) for three complete coronavirus genomes. (A-C) The Z-curves of BJO1, TOR2, and BCoV, 
respectively. It can be clearly seen that the Z-curves of BJO1 and TOR2 are very similar, while the Z-curve of BCoV is significantly different from the 
former two. This forms the basis of the present method. (D) A sketch of the three eigenvectors for a certain genome (TOR2), which illustrates the 


relationship between the three eigenvectors and the Z-curve. 


Table 2 


The geometric center and three eigenvectors of the Z-curve for each of the 24 coronavirus genomes* 
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“ Refer to the text for detailed explanation about the meaning of the mathematical symbols used in this table. 


Abbreviation 


HCoV-229E 


TGEV 
PEDV 
BCoVM 
BCoVL 
BCoVQ 
BCoV 
MHVM 
MHV2 
MHVP 
MHV 
IBV 
BJO1 
Urbani 


HKU-39849 
CUHK-WI1 
CUHK-Sul0 
SIN2500 
SIN2677 
SIN2679 
SIN2748 
SIN2774 


TW! 
TOR2 


x 


—313.52 

—21.86 
—733.04 
—272.66 
—268.51 
—257.34 
—269.55 
—129.63 
—184.66 
—197.59 
—124.73 

142.03 
—150.58 
—152.91 
= 1154.55 
—150.87 
—150.06 
—148.99 
—148.9]1 
—148.22 
—149.34 
—148.27 
=152.93 
—152.69 


—1930.76 
—1185.34 
—1930.98 
—2691.17 
—2658.04 
—2710.49 
—2643.97 
—2295.56 
—2375.26 
—2384.87 
—2284.70 
—1500.55 


—627.60 
—632.95 
—622.44 
—623.56 
—626.75 
—627.23 
—629.56 
—627.18 
—626.83 
—627.13 
—632.21 
— 630.34 


Z 


_I 


22.81 
—87.64 
—194.3] 
—94.48 
=95.32 
—90.44 
=95.53 
—438.75 
—428.28 
—384.67 
—436.33 
—289.60 
—274.85 
—270.13 
2152) 
—276.66 
—278.13 
—277.78 
—278.38 
=277.99 
=211.39 
=277.97 
=21252 
—271.00 


C, 

Cs 
0.80520 
0.95040 
0.90732 
0.95060 
0.95068 
0.95950 
0.97838 
0.96108 
0.98451 
0.98842 
0.95624 
0.72139 
0.68326 
0.67899 
0.65621 
0.67352 
0.67051 
0.67073 
0.66552 
0.66644 
0.66698 
0.66611 
0.66820 
0.67042 


Coy 

—0.16770 
—0.06255 
—0.36443 
—0.13311 
—0.13276 
—0.13084 
—0.13953 
—0.06279 
—0.07886 
—0.08099 
—0.06424 


0.02288 


—0.36116 
—0.35342 
—0.34854 
—0.35379 
—0.34988 
—0.35031 
—0.34716 
—0.34686 
—0.34948 
—0.34675 
—0.34900 
—0.34999 


Coy 
—0.56879 
—0.30468 


—0.20968 
—0.28044 


0.28034 


—0.24949 


0.15268 
0.26906 
0.15662 
0.12835 
0.28543 
0.69215 


—0.63460 
—0.64348 
—0.66926 
—0.64901 
—0.65422 
—0.65377 
—0.66073 
—0.65995 
—0.65803 
—0.66035 
—0.65705 
—0.65425 


i 
C, 
Cc 

YX 


0.20016 
0.06443 
0.37249 
0.13749 
0.13965 
0.13399 
0.14190 
0.06409 
0.07893 
0.08133 
0.06586 


—0.02895 


0.46347 
0.45769 
0.4648 | 
0.46128 
0.45881 
0.45932 
0.45863 
0.45800 
0.46041 
0.45810 
0.45918 
0.45892 


Oy 

0.97975 
0.99791 
0.92804 
0.99049 
0.99019 
0.99097 
0.98987 
0.99794 
0.99686 
0.99668 
0.99782 
0.99958 
0.88610 
0.88910 
0.88539 
0.88724 
0.88852 
0.88826 
0.88862 
0.88894 
0.88769 
0.88889 
0.88833 
0.88846 


Ce 

—0.00552 
—0.00388 
—0.00112 
—0.00411 
—0.00464 
—0.00440 
—0.00467 


0.00395 
0.00582 
0.00255 
0.00393 


—0.00286 
—0.00527 
—0.00538 
—0.00535 
—0.00495 
—0.00496 
—0.00472 
—0.00494 
—0.00471 
—0.00479 
—0.00467 
—0.00488 
—0.00502 


Ci 
i 
Cy 6 


0.55820 
0.30429 
0.19500 
0.27833 


—0.27697 


0.24781 


—0.15048 
—0.26875 
—0.15659 
—0.12813 
—0.28506 
—0.69192 


0.56422 
0.57402 
0.59442 
0.57758 
0.58302 
0.58237 
0.58885 
0.58829 
0.58580 
0.58859 
0.58538 
0.58303 


Cry 
—0.10941 
—0.01594 


—0.07709 
—0.03465 


0.04356 


—0.02921 


0.02624 
0.01345 
0.00663 
0.00792 
0.01504 


—0.01797 
—0.29052 
—0.29086 
—0.30757 
—0.29604 
—0.29684 
—0.29712 
—0.29974 
—0.29912 
=0.29977 
—0.29939 
—0.29844 
—0.29688 


Coe 
0.82246 
0.95245 
0.97777 
0.95986 
0.95989 
0.96837 
0.98826 
0.96312 
0.98764 
0.99173 
0.95839 
0.72175 
0.77282 
0.76544 
0.74301 
0.76077 
0.75629 
0.75668 
0.75061 
0.75129 
0.75298 
0.75095 
0.75383 
0.75627 
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(1) The parameters of the Z-curve for each genome. 
Based on the Z-curve, any genome can be represented 
by a three-dimensional space curve composed of N 
nodes corresponding to every base position denoted by 
Xn Vn» Z, Where n = 1,2,...,N (Figs. [A—C). To describe 
its characteristics, we calculate the following parame- 
ters. The first is the geometric center of all the m nodes 


1a 1 N 
r= Hy Le = Le a D7 (4) 


Consequently, we can obtain (x,y,z’) for each gen- 
ome. Refer to Table 2 for details. 

Then, the covariance matrix which describes the glo- 
bal distribution pattern of the three-dimensional space 
curve is calculated as follows: 


ae 


T= | oy Cy Ce L(y). DG=% 2. (5) 


O7x O07 y O77! 


= Ve - Pa - 9) 6) 


where p,g =X,y,Z 

Obviously, the matrix is a real symmetric 3 x 3 one. 
Using a 3x3 matrix to represent a three-dimensional 
Z-curve is a very rough approximation, resulting in 
information loss considerably. However, the advantage 
is that this approximation makes it possible to compare 
genomes with different lengths. It is seen that a 3x3 
covariance matrix is uniquely derived based on Eq. (6) 
for each given genome regardless of its length. From a 
geometrical point of view, the distribution pattern can 
be reduced to a three-dimensional ellipsoid approxi- 
mately. Each direction of the main axis of the ellipsoid 
can be denoted by an eigenvector and its length should 
be proportional to the square root of its associated 
eigenvalue. The eigenvectors and their associated eigen- 
values are defined as follows: 
Cy = ACe, Ce = (Cer, Cer, Ces), &=1,2,3. (7) 

Corresponding to each eigenvalue /;,, there’s an 
eigenvector C;. Corresponding to 4; < A, < /3, the three 
eigenvectors are denoted by C,, Cs, C3, respectively. It’s 
easy to obtain the eigenvalues and associated normal- 
ized eigenvectors using the Jacobi algorithm. The geo- 
metric center and three eigenvectors for each of the 24 
genomes are obtained in the same way. Refer to Table 
2 for details about the parameters. 

(11) The distance matrix derived from the above param- 
eters. In this paper, the Euclid distance is used to reflect 
the diversity between two points 


where dj; denotes the distance between the geometric 
centers of the ith and the jth genomes, and Mis the total 
number of all genomes (M = 24, here). Then we obtain a 
real M x M symmetric matrix whose elements are dj. 
To reflect the differences between the trends of every 
two three-dimensional curves, the angles between the 
corresponding eigenvectors of every two genomes are 
used. The three-dimensional vectors are denoted as 
follows: 
C=(C..¢ 


Ca, TS Mteuel hale, ©) 


where C’, is the kth vector of the ith genome. Each gen- 
ome has three such eigenvectors. According to the pro- 
jections on the three axes, the vectors can be divided into 
three groups. The three groups of vectors are repre- 
sented with arrows of different styles (refer to Fig. 
2A). Obviously they can be separated apart depending 
on their space distribution. The dark group (XY group) 
has the greatest projections on the x-axis, while the vec- 
tors represented with dot (Y group) and grey (Z’ group) 
arrows have the greatest projections on the y-axis and 
the z'-axis, respectively. For each genome, the three vec- 
tors can be divided into three groups, 1.e., each genome 
has three vectors belonging to three groups, respectively. 

The three groups of eigenvectors are obtained, and 
denoted by C,, C,, and C,, respectively (see Table 2). 
The cosine between any two vectors in a certain group 
can be computed as follows: 


C.-C 
IC] -|C] IC] 


cos OF, = LJ] lio koe, (10) 

Repeating this procedure for all the three groups, we 
obtain three real Mx M symmetric matrices. These 
matrices are then translated into angles, whose elements 


are as follows: 


ko k se _ / 
G;,=arccos (cos@;,), i,j=1,2,....M,k=x,y,z7. (11) 


The sum of 0, over Ak for given i, j can be used to re- 
flect the trend information of the eigenvectors involved 


A 2’ group 


GF Y group 
‘. , = = 
 & 


X group 


Y group 


Tenia 


020 02040608 1 
; 08 y 


Fig. 2. The three groups of eigenvectors (denoted with different 
arrows). The vectors in the X, Y, and Z’ groups are denoted by dark, 
dot, and grey arrows, respectively. (A and B) The eigenvectors of the 
24 genomes observed from different directions. It can be seen from (A) 
that the three groups can be separated according to their three- 
dimensional space distribution. (B) The vectors in Y group of the 24 
genomes are coplanar and they are almost in the x—y plane. 
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i bf — tye eg, (12) 

Consequently, two sets of parameters are obtained. 
The first reflects the difference of center positions repre- 
sented by the Euclid distance between the geometric cen- 
ters. The second indicates the difference of the trends of 
the Z-curves represented by the related eigenvectors. 
The overall distance D;; between the species 7 and j is de- 
fined by 


Dj; = di; X O;, i | eens 7 (13) 


(111) Clustering. Accordingly, a real symmetric M x M 
matrix Dj is obtained and used to reflect the evolution- 
ary distance between the species i and j. The clustering 
tree is constructed using the UPGMA method in PHY- 
LIP package (http://evolution.genetics.washington.edu/ 
phylip.html). The final phylogenetic tree is drawn using 
the DRAWGRAM program in the PHYLIP package. 
The branch lengths are not scaled according to the dis- 
tances and only the topology of the tree is concerned. 


3. Results and discussion 
3.1. The three-dimensional Z-curve for a complete genome 


As mentioned above, one of the advantages of the Z- 
curve 1s its intuitiveness. The feature of a genome can be 
viewed intuitively regardless of how long the genome 1s. 
Therefore, global and local compositional features of a 
genome can be grasped quickly in a perceivable form 
(Zhang et al., 2003). To give an intuitive comprehension 
of the difference among the three-dimensional curves, we 
take SARS-CoV strains TOR2, BJO1, and BCoV as 
examples. TOR2 and BJOI are SARS-CoVs and BCoV 
belongs to another group of coronaviruses. From the 
coordinates and the trends in Figs. 1A—C, we can see 
that the Z-curves of TOR2 and BJO1 are almost the 
same while that of BCoV 1s significantly different from 
both of them, indicating that the former two have close 
phylogenetic relationship, whereas the relationships be- 
tween the former two and the latter are more distant. 
Similarity of related Z-curves implies close evolutionary 
relationship of the organisms involved (Zhang et al., 
2003) and vice versa. This constitutes the basis of the 
current algorithm. 

The Z-curve is approximately described by the geo- 
metric center and eigenvectors, which indicate its center 
position and the trends, respectively (Fig. 1D). In Fig. 
1D the three arrows represent the three eigenvectors, 
and the point from which they start 1s the geometric cen- 
ter. The three eigenvectors of a certain genome can be 
divided into three groups according to their relation- 
ships with the axes (refer to Fig. 2). The trends of Z- 
curves carry a part of the information used to construct 
the phylogenetic tree, and some interesting results can be 


revealed by this figure. It can be seen from Fig. 2B that 
the vectors in the Y group, which have the greatest pro- 
jections on the positive y-axis, are coplanar perfectly. 
They are almost in the x—y plane. As can be seen from 
the plot, the 24 vectors are almost superposed with each 
other as a single vector. The phenomenon can also be 
seen from the data in Table 2. All of the absolute value 
of CL. (i = 1,2,...,M) are smaller than 0.0059. That 
is to say, they all have very small projections on the z’- 
axis and are constrained into the x—y plane. The vectors 
in the XY group and Z’ group (represented with black and 
grey arrows, respectively, in Fig. 2B) are also coplanar 
in the x—z’ plane, though their coplanarity is not as good 
as that of the Y group. 


3.2. Phylogenetic tree of coronaviruses 


As mentioned above, there are three groups of coro- 
naviruses. Group I includes HCoV-229E, TGEV, and 
PEDV and group II contains BCoV, BCoVL, BCoVM, 
BCoVQ, MHV, MHV2, MHVM, MHVP, etc. All the 
viruses in these two groups are mammalian viruses. 
Group III contains only avian viruses, of which only 
the genome of IBV has been completely sequenced. 
Many researchers have analyzed the phylogenetic rela- 
tionships among coronavirus genomes based on the 
3C-like proteinase, polymerase, the structural proteins 
S, E, M, and N, respectively (Marra et al., 2003; Rota 
et al., 2003). Their results indicated that SARS-CoVs 
are not closely related to any of the previously charac- 
terized coronaviruses and form a distinct group (group 
IV) within the genus Coronavirus (Marra et al., 2003; 
Rota et al., 2003). As shown in Fig. 3, four groups of 
coronaviruses can be seen from the phylogram. The 
SARS-CoVs appear to cluster together and form a sep- 
arate branch, which can be distinguished easily from 
other three groups of coronaviruses. IBV, belonging to 
group III, is situated at an independent branch, whereas 
the TGEV, PEDV, and HCoV-229E, which belong to 
group I, tend to cluster together. In another branch, 
the group II coronaviruses, including BCoV, BCoVL, 
BCoVM, BCoVQ, MHV, MHV2, MHVM, and MHVP 
tend to cluster together. First, group I and group II, 
which are all mammalian viruses, cluster together form- 
ing a bigger group. Second, this group joins group III, 
which contains only avian viruses, to form a much big- 
ger group. Finally, SARS-CoVs join them and result in 
the phylogenetic tree shown in Fig. 3. The resulting 
monophyletic clusters agree perfectly with the estab- 
lished taxonomic groups. To validate the current meth- 
od, a set of random sequences were used as a control. 
We generated 100 random sequences meeting the 
requirements in the method. Each time a phylogenetic 
analysis was done using 25 sequences including one ran- 
dom sequence and the 24 genomes. Consequently, 100 
phylogenetic trees were obtained. Ninety-eight out of 
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Fig. 3. The phylogenetic tree constructed with the current method. 
The result shows that four groups exist in the genus Coronavirus. Note 
that group I (HCoV-229E, TGEV, and PEDV) and group II (BCoVM, 
BCoVL, BCoVQ, BCoV, MHVM, MHV2, MHVP, and MHV) cluster 
together forming a bigger group firstly. Second, this group joins group 
III (IBV) to form a much bigger group. Finally, SARS-CoVs join them 
and result in the phylogenetic tree shown here. Also note that the 
resulting monophyletic clusters agree perfectly with the established 
taxonomic groups. 


the 100 trees showed that the random sequence formed a 
distinct group without disturbing the other four groups. 
Only two of the random sequences disturbed the four 
groups, suggesting that the current method is solid with 
respect to the situation that a random sequence is added. 


3.3. Comparison with the results of previous analyses 


Almost all of the previous analyses revealed that 
SARS-CoVs form a distinct group different from the 
other three groups of coronaviruses. However, the ques- 
tion that how SARS-CoVs emerged suddenly still re- 
mains open. Rota et al. and Marra et al. performed 
phylogenetic analysis based on sequence alignments 
using different genes. The results indicated that SARS- 
CoVs belong to a new group but the original group that 
SARS-CoVs were derived from could not be determined 
(Rota et al., 2003). The detection of SARS-CoV-like 
viruses in Himalayan palm civets and other small ani- 
mals in live retail market indicates a rout of interspecies 
transmission, although the natural reservoir is un- 
known. Virus infection was also detected in humans 
working at the same market. All the animal isolates re- 
tain a special 29-nucleotide fragment, which is not found 
in most human isolates (Guan et al., 2003). Stavrinides 
and Guttman made phylogenetic analysis on the SARS 
virus replicase, surface spike, matrix, and nucleocapsid 
proteins. The results support a mammalian-like origin 


for the replicase protein, an avian-like origin for the ma- 
trix and nucleocapsid proteins, and a mammalian—avian 
mosaic origin for the host-determining spike protein. 
They proposed that a recombination event between 
mammalian-like and avian-like parent viruses within 
the S gene might have taken place (Stavrinides and 
Guttman, 2004). However, the phylogenetic inference 
based on genome contents tends to locate the recombi- 
nant outside of related genomes, such as seen in Fig. 
3. Therefore, we emphasize that it is very unlikely to 
trace back the evolutionary history such as the recombi- 
nation event using the method presented. 

The present method reflects the global characters of 
genomes because the whole genome 1s taken into consid- 
eration. The phylogenetic tree (Fig. 3) reveals that the 
SARS-CoVs have undergone an independent evolution 
path after the divergence from the other coronaviruses. 
As can be seen from Fig. 3, the distance between the 
SARS-CoVs and all the others is the greatest. We sup- 
posed that the precursor of SARS-CoV may have existed 
in some hosts and developed separately for many years. 
Grigoriev found that the mutational patterns in SARS- 
CoV genome were strikingly different from the other cor- 
onaviruses in terms of mutation rates (Grigoriev, 2004). 
Phylogenetic analysis based on codon usage pattern sug- 
gested that SARS-CoV was diverged far from all the 
three known groups of coronavirus (Gu et al., 2004). 
The overall level of similarity between SARS-CoVs and 
the other coronaviruses is low (Rota et al., 2003). We 
suppose that this is due to different evolution paths. 
The isolation of SARS-CoV-like virus in Himalayan 
palm civets indicates a route of interspecies transmission. 
We hypothesize that some events such as the nucleotide 
deletion or mutation in some important genes of the pre- 
cursor may have resulted in the change of host range. 


3.4. Merits of the current method 


Due to the lack of morphological features and fre- 
quent gene exchanges, it is highly valuable to develop 
methods of molecular phylogeny for viruses. Now phy- 
logenetic analysis based on sequence alignments is well 
developed. Sequence alignments are always based on 
some special genes or some conserved fragments (Sai- 
tou, 1996). Such analysis can be done at both the amino 
acid level and the nucleotide level. To overcome the 
biases caused by individual genes or genome segments, 
it is valuable to develop methods of molecular phyloge- 
netic analysis based on whole genome sequences. Being 
different from the sequence alignment method, the cur- 
rent method is a geometric approach which is based 
on measuring the differences of Z-curves of whole gen- 
omes, including coding and non-coding sequences. 
There is no need to search for similar sequences. Proba- 
bly, the most remarkable advantages of the present 
method is its simplicity and intuitiveness. 
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The increasing availability of complete genomes has 
cast doubt instead of adding details to the phylogenetic 
tree (Qi et al., 2004). Phylogenetic analysis based on se- 
quence alignments is usually done on the most conserva- 
tive part of a gene. These fragments are usually coding 
sequences, especially the sequences coding for catalytic 
sites or the core of proteins, because they tend to be 
more evolutionarily conserved. It was said by a virolo- 
gist that people could not simply assume that a virus 
can be represented by its polymerase (http:// 
www.ncbi.nlm.nih.gov/ICTV/). A virus must be viewed 
as a whole. Non-coding sequences also play an impor- 
tant role in the virus, so do the less conserved genes. 
In addition, analyses based on different genes may lead 
to different results. Consequently, by using complete 
genomes one can avoid choosing which genes to be 
aligned. Therefore methods that are based on the whole 
genome are likely to be more objective. Recently, a k- 
string composition approach was proposed to analyze 
prokaryote phylogeny based on the whole proteome 
and satisfactory results were obtained (Qi et al., 2004); 
however, such analysis must rely on the annotation 
information. In contrast, the complete genome sequence 
is the only input of the current method; neither the 
annotation information, nor any adjustable parameters 
are needed. It is noteworthy that the current method is 
performed automatically without any human 
intervention. 

The Z-curve, which serves as the foundation of the 
present method is a powerful tool to study the complete 
genome sequence. The Z-curve contains all the informa- 
tion that the corresponding DNA sequence carries. 
Many characteristics of a gnome with biological mean- 
ing can be observed from the corresponding Z-curve, 
such as the replication origins and genomic islands for 
some bacterial and archaeal genomes (Zhang et al., 
2003). We can inspect a genome in an intuitive way 
regardless of the gene content and gene order, even 
though the sequences are of different lengths. If the Z- 
curves of two species show similar pattern even though 
the genomes have different lengths, one may infer that 
they are evolutionarily close organisms, and vice versa. 
In this paper, we use the geometric center and the eigen- 
vectors to describe the pattern approximately. Although 
this is only a rough approximation, it represents just an 
attempt to apply the Z-curve method to the phyloge- 
netic analysis and the results obtained agree well with 
previous analyses. 


3.5. Limitations of the current method 


This method is aimed to analyze the phylogeny of the 
genomes which have close phylogenetic relationships. 
Phylogenetics analysis is based on the differences among 
the three-dimensional Z-curves. In this paper, the 24 
genomes under study all belong to the same genus Coro- 


navirus. Additionally, the differences of length among 
genomes are not very large. If the genomes under study 
have much farther phylogenetic relationships, and the 
differences in length are considerably large, the present 
method may not work. Consequently, cautions must 
be taken when using the present method to study the 
phylogeny of organisms with far evolutionary distances. 
In addition, unlike the estimation based on comparison 
of orthologous genes, the Z-curve approach is also sen- 
sitive to genome rearrangements: a single large-scale 
inversion can change the form of Z-curve drastically. 
Therefore, the method presented here is considerably 
limited in the cases of genome rearrangements. In addi- 
tion, as mentioned above, the three-dimensional Z-curve 
is approximately depicted by a few parameters, such as 
the geometric center and the associated three eigenvec- 
tors. Consequently, information contained in the Z- 
curve is lost considerably in so doing. It is reasonable 
to suppose that the more information is extracted from 
the Z-curve, the more accurate result can be gained. 
Therefore, the current method can be improved if new 
and more effective algorithms are proposed to extract 
information contained in the Z-curves. In summary, 
although the present method has some advantages, it 
is still in its premature stage. The method may not be ap- 
plied to some general cases, therefore the applications of 
it are considerably limited at present. 


4. Conclusion 


A geometric approach to infer phylogenetic relation- 
ships based on the Z-curves of complete genomes is pro- 
posed in this article. Phylogenetic analysis of the 24 
coronaviruses shows that SARS-CoVs belong to a new 
cluster, named group IV, and this result is consistent 
with those of previous analyses. The method has much 
room to be improved because of the possibility to ex- 
tract information from the whole genome, instead of 
some individual genes. Although having some limita- 
tions, the current whole-genome-based geometric ap- 
proach represents a new direction to infer phylogenetic 
relationships of organisms in the post-genome era. How- 
ever, the method 1s still in its premature stage and its 
applications are considerably limited at present. 
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